Copula-based markov chain logistic regression modeling on binomial time series data

A first-order autoregressive time series model with binomial distributed random variables has been developed using the copula-based Markov chain model approach. By still utilizing conditional probability, covariate variables can also be included in the model and can be assumed as the independent variable. The time series dependent variable with a binomial distribution and continuous independent variables can be modelled using a copula-based Markov chain model with the probability of success expressed in the logit model. This study proposes a copula-based Markov chain logistic regression model with marginal binomial and joint distribution functions built through the copula function. Besides that, this study aims to estimate the parameters involved in the model. The parameters are the parameters of the logistic regression model as the relationship between the dependent and independent variables and the copula parameter as a time dependency. Using the bivariate copula functions are Clayton, Gumbel and Frank, the parameter estimation method is Maximum Likelihood Estimation (MLE). Simulations were carried out to see the efficiency of the parameter estimation and asymptotic results. Based on the simulation results, it was concluded that MLE provides an accurate estimate of the copula-based Markov chain logistic regression. In addition, the copula-based Markov chain logistic regression model can not only see the relationship between the independent and dependent variables but also provide an estimate of the time dependency of the dependent variable. The following are some of the proposed approach's highlights:• This method proposes a binomial time series data model with covariate variables by combining the logistic regression model and the first-order Markov chain model.• Parameter estimation in this model uses the Maximum Likelihood Estimation method.• The model provides the possibility to see the relationship between variables and the time dependency.

• This method proposes a binomial time series data model with covariate variables by combining the logistic regression model and the first-order Markov chain model.• Parameter estimation in this model uses the Maximum Likelihood Estimation method.
• The model provides the possibility to see the relationship between variables and the time dependency.

Specifications Table
Subject area: Mathematics and Statistics More specific subject area: Temporal Statistics, Count Time Series Data, Copula Modelling.Name of your method: Copula-Based Markov Chain Logistic Regression Model Name and reference of original method: Copula-based Markov zero-inflated count time series models with application Resource availability: Not applicable.
Introduction Kenzie (1986) discussed time series modelling with discrete random variables and used a first-order autoregressive structure to estimate the correlation between two adjacent observations, called INAR(1) [1] .Since discrete time series data have more complex features, the INAR(1) model is more complicated than the AR(1) model for continuous value time series models [1 , 2] .Karlis & Pedeli (2013) stated that INAR(1) has a limitation of such specifications in that it allows only for positive correlation between the two series [3] .INAR (1) model with binomial time series data is known as the binomial AR(1) model.Binomial time series model has been well applied to several real problems, such as in the fields of finance and industry [4 , 5] .Huang & Emura (2022) used a copula-based Markov in modelling binomial time series to overcome the complexity and limitation of Binomial AR(1) [5] .Apart from that, the copula method has several advantages, namely capturing dependencies between two time series, being used flexibly for discrete bivariate distributions, and allowing for a negative correlation [3] .Also, the marginals need not be in the same family and are easier to expand to an n -dimensional multivariate class employing the vine copula [6] .
Binomial time series models in [1 , 4 , 5] are proposed on univariate binomial time series data with probability  .However, some data can be provided with covariate variables, which can be described as probability functions.Wu & Cui (2012) proposed a semiparametric method to obtain linear model parameter estimation which expresses the probability of success in the logit model [7] , while Dunsmuir & He (2016) developed an approach that uses estimates based on one-dimensional marginal distribution through the generalized linear mixed model (GLMM) method [8] .Both approaches proposed parameter estimation in which time dependencies are expressed in latent processes and produce consistent and asymptotically normal regression parameter estimates.
Even though GLMM method can easily estimate parameters in a binomial time series model, time dependencies are neglected and are expressed in latent processes.Alqawba & Diawara (2020) stated that ignoring time dependencies can give inaccurate results and proposed a discrete time series model through a copula-based joint distribution of zero-inflated count time series observations [9] .The copula function is used in constructing the joint distribution function in Markov chains which was first introduced by Joe [10] .Many studies define copula-based predictive models as conditional expectations of the dependent variable given the independent variables [11][12][13][14][15] .Some important advantages of the copula-based Markov model are that it can avoid some of the tight distribution assumptions on marginal variables and can be extended to non-stationary processes through time-varying parameters within univariate margins of discrete distributions.In addition, the copula function can more easily handle n -dimensional joint distributions compared to multivariate joint distribution functions with certain distributions.
In this study, we modify copula-based Markov zero-inflated count time series model for binomial time series models with covariate variables modelled as the probability of success.The proposed model is called the copula-based Markov chain logistic regression model.Copula functions can capture time dependencies and the models used are the Clayton, Gumbel and Frank Copula.These three copulas are Archimedean family, which are frequently utilized in a variety of applications due to their closed forms, mathematical tractability and flexibility in capturing strong dependencies [16][17][18] .The probability of success of the marginal variable is determined from the inverse logit model based on the estimation results of the linear model parameters.We propose the Maximum Likelihood Estimation procedure for estimating model parameters, both logit parameter estimates and time dependency estimates.Firstly, we present computational parameter estimation and simulation to assess the performance of the proposed estimation method.Moreover, we apply the proposed method to model the relationships between climatic factors and influenza incidence in Singapore in 2012-2013.

Copula
Copula is a function that combines the joint multivariate distribution function,  ( 1 , . ..,   ) , with its one-dimensional marginal distribution function,  1 ( 1 ) , …,   (  ) , where the marginal distribution function is uniformly distributed over the range value [0 , 1 ] .A copula with only 2 joint distribution is called a bivariate copula or 2-dimensional copula.The basis of the copula is Sklar's theorem which states that the copula is a bivariate distribution function that has a uniform marginal distribution over the interval [0, 1].Sklar's theorem explains the role that the copula plays in the relationship between its bivariate and univariate marginal distribution functions [19] .
There are several methods for constructing bivariate Copula, including the Archimedean Copula which is widely used and is an important family in copula-based modelling.The copula model that will be used in this paper is the Archimedean Copula with one parameter, namely Clayton, Gumbel and Frank Copula.

Clayton copula
Clayton Copula is an asymmetric Archimedean Copula which exhibits a greater dependency on the negative tail than on the positive tail.Clayton Copula has the following distribution function: and the density is where  ∈ [−1 , ∞) − 0 .Clayton copula has an upper tail dependency   = 0 and a lower tail dependency   = 2 (− 1  ) .The conditional distribution of Clayton Copula is given by:

Gumbel Copula
In contrast to Clayton Copula, Gumbel Copula has a lower tail dependency   = 0 and an upper tail dependency   = 2 (− 1  ) , hence this copula shows greater dependency on the positive tail.Gumbel Copula has the distribution function is and the density function is . The conditional distribution of Gumbel Copula is given by:

Frank copula
Frank copula exhibits a parallel correlation structure and is the only Archimedean copula that has a symmetrical shape.Frank copula has positive and negative tail dependencies with a value of zero.For Frank Copula, the distribution function is: and the probability distribution function is The conditional distribution of Frank Copula is given by:

Binomial logistic regression
Suppose the response vector  = ( 1 , … ,   )  is assumed to have a binomial distribution.For binomial data, the observed response for the  th observation,  = 1 , 2 , . ..,  , is the proportion denoted by     .The response corresponding to the  th observation is the binomial distribution (  ,   ) , where   is the probability of success or response probability and   is the total of observations.Therefore, the expected value for the response variable is (  ) =     and   (  ) =     (1 −   ) .
In logistic regression, the model is explored in the probability of success in the  th observation,   = ( ) which value in the range (0,1).To ensure that the probability is between 0 and 1, a logistic transformation is used to model the linear equation.Logistic or logit transformation is a transformation of the probability of success  which is written as [16] .(10) and can be realigned into the success probability equation shown below: MethodsX 12 (2024) 102509

Copul-based Markov Chain autoregressive logistic regression model
The general form of the first-order Markov model with the transition probability is defined as where   is the stochastic continuous latent process i.i.d and ( ⋅) is assumed to be an increasing function at   for  = 1 , … ,  .Therefore, the observed values of   depend on the past only through   −1 .If the   process is continuous, then there is a simple stochastic representation for the Markov model.However, for discrete processes, the stochastic representation of the model becomes more complicated.
Let   be a discrete time series following a first-order Markov chain.By utilizing the probability chain rule and the Markov property, the multivariate joint probability distribution of  1 , … ,   is given as The transition probability depends on the joint probability function of (  ,   −1 ) and can be determined using the copula function.Thus, the joint cumulative function with margin   and   −1 is expressed as where (⋅, ⋅; ) is a bivariate copula function with parameter .
, is the covariate corresponding to the probability parameter   .The parameter vector  = ( 0 ,  1 , … ,   ) is the unknown marginal regression coefficient.Therefore, the transition probability is given as where [16][17][18] .Let  = ( 1 , … ,   )  is a vector of the binomial time series data observed at  time points, and  = ( 1 , … ,   )  is the covariate matrix  × ( + 1 ) , where   = (1 ,  1  , … ,   ) for  = 1 , … ,  and  is the number of covariate variables.The binomial logistic regression model for discrete time series data is defined in Eq. (11) , where  = ( 0 ,  1 , … ,   )  is the parameter coefficient vector of the logistic regression model to be estimated.From this regression model, π = exp (  ) 1+exp (  ) is the success probability for the marginal variable   which has a binomial distribution.Therefore, the probability function and distribution of the variable   is and It is also assumed in this section that the discrete time series {  |  ; ,  = 1 , 2 , … ,  } are a first order Markov process with discrete state space, that is,   |  −1 ,   ; ,  = 1 , 2 , … , .Under this assumption, the time series probabilistic property is determined by the joint distribution of   and   −1 and is denoted as  (  ,   −1 |  , ) .This joint distribution can be determined based on the bivariate copula model, so that for ∀(  ,   −1 ) ∈  has a joint cumulative distribution as in Eq. ( 14) , where (⋅, ⋅; ) is the copula function that does not vary with the covariate, and  is the unknown copula parameter.
For continuous time series, by Sklar's theorem  (  ,   −1 |  , ) can be represented by the marginal conditional distribution function  (  |  , ) of   and the unique copula function (⋅, ⋅; ) .However, when the marginal distribution function  ( ⋅) is discrete data, the joint distribution  (⋅, ⋅) is uniquely defined only at certain intervals.With logistic regression model   on covariate   which is continuous, it is possible to expand the value of  (  |  , ) from a number of discrete points into intervals.Thus, it can be ensured that the function of the copula can be determined uniquely by the population in all regions from the possible values of  (  |  , ) .This is summarized in the following statement: Proposition 1. Suppose  = ( 1 , … ,   )  is a binomial time series data vector and   is characterized by binomial logistic regression with a marginal distribution of  (  |  ; ) and satisfies the first-order markov process Pr ( Suppose that there is also a bivariate copula (⋅, ⋅; ) such that for ∀(  ,   −1 ) ∈  has a joint cumulative distribution where (⋅, ⋅; ) is the copula function that does not vary with the covariate, and  is the unknown copula parameter.

Estimation parameter
The parameter vector of the copula-based Markov Chain Logistic Regression model for the binomial data is estimated by the maximum likelihood estimation (MLE) method.This method is easy to apply when the selected copula family has a closed-form and provides advantages in model selection through the log-likelihood function.By using the transition probability in Eq. ( 15) and expressing the likelihood function in logarithms, the following log-likelihood function is obtained: Based on the approach in [9] , then the estimation of the probability parameters  and  in the AR(1) Binomial model are obtaioned simultaneously by maximizing the log-likelihood value in Eq. ( 23) or where  is the copula parameter and  is the logistic regression marginal parameter.When estimating parameters, the log-likelihood function  (, ) is optimized, resulting in a Hessian Matrix.The observed Fisher information matrix of MLE ( β, α ) which can be used to calculate the standard error is produced by the Hessian matrix.To obtain the parameter estimation ( β, α ) using MLE, the scoring function is required which is stated in the following lemma: Lemma 1. Suppose the likelihood function of copula-based Markov Chain logistic regression model for binomial data is given by (, ) , where θ = ( β, α) is the maximum likelihood estimator, then the score function of the copula-based Markov Chain logistic regression model is Where With Δ ′ , and Δ ′ , are partial derivative of Δ  with respect to the marginal parameter  and  respectively.(⋅, ⋅; ) is bivariate copula function, hence conditional copula function (⋅|⋅; ) and partial derivative Δ ′ , depend on used copula function.Proof of Lemma 1: The log-likelihood function of copula-based Markov Chain logistic regression model for binomial data is Firstly, we will derive the log-likelihood function for the marginal parameter   .Let Δ  = (  ,   ; ) −  (  ,   − 1; ) − (  − 1 ,   ; ) + (  − 1 ,   − 1; ) , its partial derivative with respect to the marginal parameter  is given by: (⋅|⋅; ) known as the conditional copula function is the partial derivative of bivariate copula function, (⋅, ⋅; ) , with respect to the marginal distribution function.
For   , the marginal pdf for binomial distribution given in Eq (16) , its partial derivative with respect to the marginal parameter  is Therefore, partial derivative of our log-likelihood function with respect to the marginal parameter  is The next is to derive a partial derivative of the log-likelihood function with respect to the copula parameter .Since the probability mass function   does not contain a parameter component , then    = 0 and  (,) α is only determined from Δ  .The partial derivative of (  ,   ; ) with respect to parameter  depends on the copula model used.Therefore, in general, the partial derivative of the log-likelihood function of copula-based Markov chain Logistic regression model for binomial data to the copula parameter  is obtained as follows And there exists a root (, ) of (,) (,) = 0 , such that we have

Computation and simulation
This section discusses computation and simulation for parameter estimation of the Copula-Based Markov Chain Logistic Regression model.Computations are set up to calculate the log-likelihood function and determine the parameter values that optimize the function.Simulations are carried out on the generated data to see the performance of parameter estimation.Therefore, it is necessary to develop an algorithm for parameter estimation and data generation.The computational procedure to estimate parameters is explained as follows: 1. Input data in the form of the dependent variable  , total sample  and the independent variable  .2. Define the success probability function   of dependent variable   based on independent variable   using Eq. ( 11) .3. Define the probability function  (  ;   , π ) of the marginal variable using Eq. ( 16) with the success probability in step (2) 4. Define the cumulative distribution function  (  ;   , π ) of the binomial marginal variables using Eq. ( 17) . 5. Arrange the marginal distribution function   ,   , (   − 1) and (  − 1 ) .6. Define the joint cumulative function using Eq. ( 14) .7. Define the log-likelihood function based on Eq. ( 23) .8. Set the initial value of the logistic regression parameters  0 = (0 , 0 , … , 0 ) and the copula parameter  0 .9. Get the best parameter estimations β and α that produce the largest copula Loglikelihood value.
The procedure for generating data refers to [21] which generates data from the Markov process with bivariate copula and Poisson regression model.The steps are explained as follow: The simulation is carried out with the aim of assessing the parameter estimation performance of the Copula-Based Markov chain Logistic Regression model on Binomial Time Series data.For this purpose, the data are generated from model with the parameter  and .In this study, data are generated with logistic regression model parameter  = (−2 , 0 .3 , −0 .5 ) and 3 Kendall- values, namely weak (0.2), medium (0.5) and strong (0.8).Data generation is also carried out by considering three sample sizes, n = 200, 500 and 1000.For each scenario, the simulation was repeated 500 times.Computation and simulation are implemented using R based on Algorithm 1 , 2 and 3 .
Based on the previously determined copula model, namely Clayton, Gumbel and Frank, each Kendal tau correlation value is converted to the copula model parameter values.Each copula parameters for 3 Kendall- values are presented in Table 1 .
Table 2 .presents the simulation results of the copula-based Markov chain logistic regression model with Clayton copula.It can be seen that the mean of marginal parameter estimates for T = 200,500 and 1000 are close to the actual parameter values.In addition, it

End For
End Function can be seen that ( ) get smaller with increasing number of samples.For time dependence, the results of the parameter estimation with the Clayton copula model show that the mean of estimate are almost the same as the actual parameters, namely for  = 0 .5 , 2 and 8. Based on the ( ) , it can be seen that the value tends to increase as the value of the parameter increases, but this value will also decrease with increasing sample size.
Table 3 presents the simulation results of the estimated parameters of the copula-based markov chain autoregressive logistic regression model with the gumbel copula model.The means of the estimated marginal parameter show that the estimations are not biased with the relatively small value of ( ) .For estimation of the copula parameter with values  = 2 and 6 it produces an average that is almost the same as the actual copula parameter.Meanwhile, for the estimation of the copula parameter with  = 20 it can be seen that the mean values are smaller than the actual value with a fairly large ( ) .However, the average value is getting closer to the true value with ( ) which tends to decrease as the sample T value increases Similar to the copula-based markov chain autoregressive logistic regression simulation results for binomial data using the Clayton and Gumbel models, the simulation results using the copula frank show that the mean estimated marginal parameters are close to   their true values, as shown in Table 4 .From the table it can also be seen that the estimation of the marginal parameters shows unbiased results for  = 200 , 500 and 1000 .For the 3 parameter values used, namely  = 1 .25 , 2 and 5, the simulation average results also approach the actual values and the ( ) also approaches zero as the sample size increases.
From the simulation results of the copula-based Markov chain autoregressive logistic regression model for binomial data with copula Clayton, Gumbel and Frank, it can be concluded that parameter estimation using the MLE method on the copula-based Markov chain autoregressive logistic regression model for binomial data provides unbiased estimates for  and .This can be seen from the mean estimated parameter value which are close to the actual value and the  values which are close to zero when the sample size are getting bigger.

Apllication of copula-based logistic autoregressive regression model
The Copula-Based Logistic Autoregressive Regression Model is implemented in Program R and applied to the human influenza data set in Singapore obtained from [22] .Singapore is a tropical country and the weather is considered to have an important role in the transmission of influenza.It has been shown previously that the global dynamics of influenza outbreaks is determined by seasonal fluctuations in climatic factors such as temperature, amount of rainfall and relative humidity.The dependent data are the number of positive influenza samples and the number of monthly influenza surveillance specimens for the period from October 2011 to March 2014.The independent data are the monthly temperature (in degrees Celsius), the amount of precipitation (in mm/month), and vapor pressure (in hPa).
Fig. 1 shows time series plots of the dependent and independent variables from a case study of influenza data in Singapore.From the figure it can be seen that the proportion of people affected by confirmed influenza to the specimens examined was in the range of 0.20 %-0.65 % with the lowest in July 2013 and the largest percentage in December 2013.The figure above also shows that the proportion reached its highest level in December or January is also characterized by a decrease in air temperature and an increase in rainfall and humidity.Therefore, it can be suspected that cases of influenza have a negative relationship with temperature and have a positive relationship with rainfall and air humidity.

Results and analyses
In this study it is assumed that the data on people affected by confirmed influenza on the specimens examined had a binomial distribution, so that the marginal distribution used was binomial with a logistic regression model.For the time dependence structure, the Markov model used is the Clayton, Gumbel and Frank bivariate function.The parameter estimation results of each model are presented in Table 5 .
Table 5 shows a comparison of the BLR model, and the copula-based Markov chain logistic regression model with Clayton, Gumbel and Frank Copula.The results show that the probability parameter estimations β for three copulas are nearly the same with time dependencies are 0.14928, 2.67498 and 2.19288 for Clayton, Gumbel and Frank Copula respectively.The higest log-likelihood value is produced by Frank Copula which is also followed by the smallest MSE and MAPE.Therefore, based on the log-likelihood, MSE and MAPE values, the best copula to model the dependency structure on human influenza data is the Frank copula.
After determining the best copula for the copula-based Markov chain logistic regression model, the next step is to interpret the parameter estimations.Before discussing the parameters of the copula model in the time dependent structure, the parameters of the marginal model are interpreted first.The copula-based Markov Chain Logistic Autoregressive Regression model with Frank Copula shows the negative impact of increasing air temperature due to β1 < 0 .In addition to temperature, the coefficient of humidity is also negative which results in decreased cases of influenza with increasing humidity.For the precipitation variable, the parameter coefficient is positive with β2 = 0 .0007 which can be interpreted that the more rainfall, the chance of an increase in influenza cases also increases.However, in this model, β3 is negative, indicating that increasing relative humidity reduces the probability of adding influenza cases.
The parameter values are less than one and relatively small, it is expected that the independent variables has no significant effect on the dependent variable.Because our focus in this study is to model and estimate parameters, the significance test cannot be carried out.However, because our focuses in this research are building models and parameter estimates, these parameters are still included in the model.Therefore, the marginal distribution used in determining the time dependence on influenza data has the success proportion given as:  =  −0 . 00047− 0 . 01528  +0 . 00073  −0 . 00036 1 +  −0 . 00047− 0 . 01528  +0 . 00073  −0 . 00036 (33) Furthermore, the interpretation of the dependency structure is in the form of autocorrelation.The best model for influenza data is copula-based Markov chain logistic regression model with Frank Copula function.Consequently, it can be concluded that the influenza data has a symmetrical time dependence and has no dependence on the lower and upper tails.In Table 4 , the copula parameter value for the Frank copula model is 2.19288 and when expressed in the Kendals-tau value is 0.2320.It can be stated that today's influenza activity has a weak effect on tomorrow's influenza cases.Therefore, the conditional expectation Y t given   −1 in the copula-based Markov chain logistic regression model can be expressed in the following equation: where  is the binomial cumulative distribution with the probability of success in Eq. (33) .Fig. 2 displays the predicted values of the Markov chain logistic regression model with Clayton, Gumbel and Frank Copula for the number of positive influenza in Singapore based on climatic factors.The real data is marked with black lines, while the prediction results using Clayton, Gumbel and Frank copulas are marked with red, blue and green lines respectively.Fitted values with the copula model based on Markov chain show results that are almost similar to the actual values.However, the Markov chain model with frank copulas works better than the Markov model with Clayton and Gumbel copulas.
Furthermore, the copula-based Markov chain model are compared with other binomial data models, namely the Binomial Logistic Regression (BLR) model for ordinary binomial response data which does not assume autocorrelation and GLMM for binomial time

Fig. 1 .
Fig. 1.Time series plot of monthly propotion of influenza surveillance specimens humidity, precipitation and temperature in Singapore.
, 20]stated that there are several conditions that fulfill the consistency and asymptotic normality properties of MLE estimation to be applied in copula-based Markov chain models, namely:1.The maximum likelihood estimate (, ) are obtained from solving optimization of the score function.2. All states of the Markov chain communicate with each other (meaning that there are no transient states).3. The set of  for which   ( | ; (, ) ) is positive does not depend on (, ) .4. (, ) is continuous and two times continuously differentiable.5.For  = (, ) ∈ Θ, there exists a neighborhood   of  such that for all , , where   means expectation assuming that the true parameter value is  and  1 start with a stationary distribution.6.  (  |  , ) is absolutely continuous with respect to   (  |  ; ) .

Table 1
Copula model parameters for data generating.

Table 2
Mean of estimates, MSEs (within parentheses) for clayton copula-Based Markov Chain model using logistic regression model.

Table 3
Mean of estimates, MSEs (within parentheses) for gumbel copula-based Markov Chain model using logistic regression model.

Table 4
Mean of estimates, MSEs (within parentheses) for Frank copula-based Markov Chain model using logistic regression model.

Table 5
Parameter estimates for copula-based Markov Chain logistic autoregressive regression models fit to the human influenza data.
Fig. 2. Predicted values using copula-based Markov chain Logistic regression models to human influenza Data in Singapore.